Building RNN, LSTM, and GRU for time series using PyTorch

In this notebook, I’d like to give you a bit of an introduction to some of the RNN structures, such as RNN, LSTM, and GRU, and help you get started building your deep learning models for time-series forecasting using PyTorch.

Due to PyTorch’s recency, it has been somewhat difficult for me to find the relevant pieces of information and code samples from the get-go, which is usually a bit easier with frameworks that have been around for a while, say TensorFlow. So, I decided to put together the things I would have liked to know earlier.

Installing libraries

import torch
import torch.nn as nn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"{device}" " is available.")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_26416/3183702149.py in <module>
----> 1 import torch
      2 import torch.nn as nn
      3 import pandas as pd
      4 import numpy as np
      5 import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'torch'

Loading the dataset

Well, I suppose we need some time-series data to start with. Be it payment transactions or stock exchange data, time-series data is everywhere. One such public dataset is PJM’s Hourly Energy Consumption data, a univariate time-series dataset of 10+ years of hourly observations collected from different US regions. I’ll be using the PJM East region data, which originally has the hourly energy consumption data from 2001 to 2018, but any of the datasets provided in the link should work.

The following cell pops up an upload widget, so that you can upload your data to this notebook.

from google.colab import files
# data_to_load = files.upload()

If you’d like to work with a dataset other than PJME_hourly.csv, you can change the field in the next cell accordingly.

import io
df = pd.read_csv(io.BytesIO(data_to_load['PJME_hourly.csv']))
import plotly.graph_objs as go
from plotly.offline import iplot

def plot_dataset(df, title):
    data = []
    
    value = go.Scatter(
        x=df.index,
        y=df.value,
        mode="lines",
        name="values",
        marker=dict(),
        text=df.index,
        line=dict(color="rgba(0,0,0, 0.3)"),
    )
    data.append(value)

    layout = dict(
        title=title,
        xaxis=dict(title="Date", ticklen=5, zeroline=False),
        yaxis=dict(title="Value", ticklen=5, zeroline=False),
    )

    fig = dict(data=data, layout=layout)
    iplot(fig)
    
df = df.set_index(['Datetime'])
df = df.rename(columns={'PJME_MW': 'value'})

df.index = pd.to_datetime(df.index)
if not df.index.is_monotonic:
    df = df.sort_index()
    
plot_dataset(df, title='PJM East (PJME) Region: estimated energy consumption in Megawatts (MW)')
    

The next step is to generate feature columns to transform our univariate dataset into a multivariate dataset. We will convert this time series into a supervised learning problem if you will. In some datasets, such features as hourly temperature, humidity, or precipitation, are readily available. However, in our dataset, no extra information could help us predict the energy consumption is given. So, it falls to our lot to create such predictors, i.e., feature columns.

I’ll show you two popular ways to generate features: passing lagged observations as features and creating date time features from the DateTime index. Both approaches have their advantages and disadvantages, and each may prove more useful depending on the task at hand.

Generating time-lagged observations

Let’s start with using time steps as features. In other words, we’re trying to predict the next value, X(t+n), from the previous n observations Xt, X+1, …, and X(t+n-1). Then, what we need to do is simply creating n columns with the preceding observations. Luckily, Pandas provides the method shift() to shift the values in a column. So, we can write a for loop to create such lagged observations by shifting the values in a column by n times and removing the first n columns.

After setting the number of input features, i.e., lagged observations, to 100, we get the following DataFrame with 101 columns, one for the actual value, and the rest for the preceding 100 observations at each row.

def generate_time_lags(df, n_lags):
    df_n = df.copy()
    for n in range(1, n_lags + 1):
        df_n[f"lag{n}"] = df_n["value"].shift(n)
    df_n = df_n.iloc[n_lags:]
    return df_n

input_dim = 100

df_timelags = generate_time_lags(df, input_dim)
df_timelags
value lag1 lag2 lag3 lag4 lag5 lag6 lag7 lag8 lag9 lag10 lag11 lag12 lag13 lag14 lag15 lag16 lag17 lag18 lag19 lag20 lag21 lag22 lag23 lag24 lag25 lag26 lag27 lag28 lag29 lag30 lag31 lag32 lag33 lag34 lag35 lag36 lag37 lag38 lag39 ... lag61 lag62 lag63 lag64 lag65 lag66 lag67 lag68 lag69 lag70 lag71 lag72 lag73 lag74 lag75 lag76 lag77 lag78 lag79 lag80 lag81 lag82 lag83 lag84 lag85 lag86 lag87 lag88 lag89 lag90 lag91 lag92 lag93 lag94 lag95 lag96 lag97 lag98 lag99 lag100
Datetime
2002-01-05 05:00:00 26822.0 26669.0 27034.0 27501.0 28635.0 30924.0 33202.0 35368.0 36762.0 37539.0 38263.0 37710.0 34563.0 33352.0 33577.0 34237.0 34639.0 35333.0 35809.0 36318.0 36721.0 36958.0 34579.0 30540.0 28637.0 28122.0 28145.0 28421.0 29338.0 31038.0 33869.0 36636.0 38328.0 39177.0 39662.0 39261.0 35807.0 34497.0 34478.0 35186.0 ... 34169.0 34304.0 34978.0 35548.0 36284.0 36758.0 37035.0 37322.0 37313.0 34725.0 30748.0 28405.0 27533.0 27301.0 27437.0 28121.0 29563.0 31857.0 34007.0 35285.0 35639.0 35732.0 35103.0 31271.0 29720.0 29798.0 30360.0 31031.0 31496.0 31395.0 30692.0 29943.0 29595.0 29308.0 28654.0 28057.0 27899.0 28357.0 29265.0 30393.0
2002-01-05 06:00:00 27399.0 26822.0 26669.0 27034.0 27501.0 28635.0 30924.0 33202.0 35368.0 36762.0 37539.0 38263.0 37710.0 34563.0 33352.0 33577.0 34237.0 34639.0 35333.0 35809.0 36318.0 36721.0 36958.0 34579.0 30540.0 28637.0 28122.0 28145.0 28421.0 29338.0 31038.0 33869.0 36636.0 38328.0 39177.0 39662.0 39261.0 35807.0 34497.0 34478.0 ... 35674.0 34169.0 34304.0 34978.0 35548.0 36284.0 36758.0 37035.0 37322.0 37313.0 34725.0 30748.0 28405.0 27533.0 27301.0 27437.0 28121.0 29563.0 31857.0 34007.0 35285.0 35639.0 35732.0 35103.0 31271.0 29720.0 29798.0 30360.0 31031.0 31496.0 31395.0 30692.0 29943.0 29595.0 29308.0 28654.0 28057.0 27899.0 28357.0 29265.0
2002-01-05 07:00:00 28557.0 27399.0 26822.0 26669.0 27034.0 27501.0 28635.0 30924.0 33202.0 35368.0 36762.0 37539.0 38263.0 37710.0 34563.0 33352.0 33577.0 34237.0 34639.0 35333.0 35809.0 36318.0 36721.0 36958.0 34579.0 30540.0 28637.0 28122.0 28145.0 28421.0 29338.0 31038.0 33869.0 36636.0 38328.0 39177.0 39662.0 39261.0 35807.0 34497.0 ... 39532.0 35674.0 34169.0 34304.0 34978.0 35548.0 36284.0 36758.0 37035.0 37322.0 37313.0 34725.0 30748.0 28405.0 27533.0 27301.0 27437.0 28121.0 29563.0 31857.0 34007.0 35285.0 35639.0 35732.0 35103.0 31271.0 29720.0 29798.0 30360.0 31031.0 31496.0 31395.0 30692.0 29943.0 29595.0 29308.0 28654.0 28057.0 27899.0 28357.0
2002-01-05 08:00:00 29709.0 28557.0 27399.0 26822.0 26669.0 27034.0 27501.0 28635.0 30924.0 33202.0 35368.0 36762.0 37539.0 38263.0 37710.0 34563.0 33352.0 33577.0 34237.0 34639.0 35333.0 35809.0 36318.0 36721.0 36958.0 34579.0 30540.0 28637.0 28122.0 28145.0 28421.0 29338.0 31038.0 33869.0 36636.0 38328.0 39177.0 39662.0 39261.0 35807.0 ... 40002.0 39532.0 35674.0 34169.0 34304.0 34978.0 35548.0 36284.0 36758.0 37035.0 37322.0 37313.0 34725.0 30748.0 28405.0 27533.0 27301.0 27437.0 28121.0 29563.0 31857.0 34007.0 35285.0 35639.0 35732.0 35103.0 31271.0 29720.0 29798.0 30360.0 31031.0 31496.0 31395.0 30692.0 29943.0 29595.0 29308.0 28654.0 28057.0 27899.0
2002-01-05 09:00:00 31241.0 29709.0 28557.0 27399.0 26822.0 26669.0 27034.0 27501.0 28635.0 30924.0 33202.0 35368.0 36762.0 37539.0 38263.0 37710.0 34563.0 33352.0 33577.0 34237.0 34639.0 35333.0 35809.0 36318.0 36721.0 36958.0 34579.0 30540.0 28637.0 28122.0 28145.0 28421.0 29338.0 31038.0 33869.0 36636.0 38328.0 39177.0 39662.0 39261.0 ... 39484.0 40002.0 39532.0 35674.0 34169.0 34304.0 34978.0 35548.0 36284.0 36758.0 37035.0 37322.0 37313.0 34725.0 30748.0 28405.0 27533.0 27301.0 27437.0 28121.0 29563.0 31857.0 34007.0 35285.0 35639.0 35732.0 35103.0 31271.0 29720.0 29798.0 30360.0 31031.0 31496.0 31395.0 30692.0 29943.0 29595.0 29308.0 28654.0 28057.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2018-08-02 20:00:00 44057.0 45641.0 46760.0 46816.0 46989.0 47154.0 46534.0 45372.0 43954.0 42189.0 39902.0 37810.0 35645.0 33182.0 31197.0 29854.0 29791.0 30543.0 32094.0 34283.0 37158.0 40666.0 44094.0 45985.0 46912.0 48428.0 49308.0 48855.0 47867.0 46430.0 45313.0 43887.0 42403.0 40316.0 38572.0 36765.0 34741.0 32365.0 30035.0 28481.0 ... 28913.0 26863.0 25408.0 25050.0 25356.0 26192.0 27576.0 29701.0 32303.0 34879.0 35899.0 36166.0 36901.0 37564.0 37600.0 37615.0 37624.0 37671.0 37228.0 36747.0 36060.0 34779.0 33081.0 31050.0 28676.0 26824.0 25479.0 25200.0 25675.0 26779.0 28389.0 30789.0 33747.0 36581.0 37870.0 39089.0 40517.0 40709.0 39906.0 38637.0
2018-08-02 21:00:00 43256.0 44057.0 45641.0 46760.0 46816.0 46989.0 47154.0 46534.0 45372.0 43954.0 42189.0 39902.0 37810.0 35645.0 33182.0 31197.0 29854.0 29791.0 30543.0 32094.0 34283.0 37158.0 40666.0 44094.0 45985.0 46912.0 48428.0 49308.0 48855.0 47867.0 46430.0 45313.0 43887.0 42403.0 40316.0 38572.0 36765.0 34741.0 32365.0 30035.0 ... 31217.0 28913.0 26863.0 25408.0 25050.0 25356.0 26192.0 27576.0 29701.0 32303.0 34879.0 35899.0 36166.0 36901.0 37564.0 37600.0 37615.0 37624.0 37671.0 37228.0 36747.0 36060.0 34779.0 33081.0 31050.0 28676.0 26824.0 25479.0 25200.0 25675.0 26779.0 28389.0 30789.0 33747.0 36581.0 37870.0 39089.0 40517.0 40709.0 39906.0
2018-08-02 22:00:00 41552.0 43256.0 44057.0 45641.0 46760.0 46816.0 46989.0 47154.0 46534.0 45372.0 43954.0 42189.0 39902.0 37810.0 35645.0 33182.0 31197.0 29854.0 29791.0 30543.0 32094.0 34283.0 37158.0 40666.0 44094.0 45985.0 46912.0 48428.0 49308.0 48855.0 47867.0 46430.0 45313.0 43887.0 42403.0 40316.0 38572.0 36765.0 34741.0 32365.0 ... 33056.0 31217.0 28913.0 26863.0 25408.0 25050.0 25356.0 26192.0 27576.0 29701.0 32303.0 34879.0 35899.0 36166.0 36901.0 37564.0 37600.0 37615.0 37624.0 37671.0 37228.0 36747.0 36060.0 34779.0 33081.0 31050.0 28676.0 26824.0 25479.0 25200.0 25675.0 26779.0 28389.0 30789.0 33747.0 36581.0 37870.0 39089.0 40517.0 40709.0
2018-08-02 23:00:00 38500.0 41552.0 43256.0 44057.0 45641.0 46760.0 46816.0 46989.0 47154.0 46534.0 45372.0 43954.0 42189.0 39902.0 37810.0 35645.0 33182.0 31197.0 29854.0 29791.0 30543.0 32094.0 34283.0 37158.0 40666.0 44094.0 45985.0 46912.0 48428.0 49308.0 48855.0 47867.0 46430.0 45313.0 43887.0 42403.0 40316.0 38572.0 36765.0 34741.0 ... 34799.0 33056.0 31217.0 28913.0 26863.0 25408.0 25050.0 25356.0 26192.0 27576.0 29701.0 32303.0 34879.0 35899.0 36166.0 36901.0 37564.0 37600.0 37615.0 37624.0 37671.0 37228.0 36747.0 36060.0 34779.0 33081.0 31050.0 28676.0 26824.0 25479.0 25200.0 25675.0 26779.0 28389.0 30789.0 33747.0 36581.0 37870.0 39089.0 40517.0
2018-08-03 00:00:00 35486.0 38500.0 41552.0 43256.0 44057.0 45641.0 46760.0 46816.0 46989.0 47154.0 46534.0 45372.0 43954.0 42189.0 39902.0 37810.0 35645.0 33182.0 31197.0 29854.0 29791.0 30543.0 32094.0 34283.0 37158.0 40666.0 44094.0 45985.0 46912.0 48428.0 49308.0 48855.0 47867.0 46430.0 45313.0 43887.0 42403.0 40316.0 38572.0 36765.0 ... 36507.0 34799.0 33056.0 31217.0 28913.0 26863.0 25408.0 25050.0 25356.0 26192.0 27576.0 29701.0 32303.0 34879.0 35899.0 36166.0 36901.0 37564.0 37600.0 37615.0 37624.0 37671.0 37228.0 36747.0 36060.0 34779.0 33081.0 31050.0 28676.0 26824.0 25479.0 25200.0 25675.0 26779.0 28389.0 30789.0 33747.0 36581.0 37870.0 39089.0

145266 rows × 101 columns

Generating date/time predictors

Despite its name, feature engineering is generally more art than science. Nonetheless, there are some rules of thumb that can guide data scientists and the like. My goal in this section is not to go through all such practices here, but just to demonstrate a couple of them for you to experiment on your own. In effect, feature engineering is very much dependent on the domain that you’re working in, possibly requiring the creation of a different set of features for the task at hand.

Having a univariate time-series dataset at hand, it seems only logical to start by generating date and time features. As we have already converted the index of the dataset into Pandas’ DatetimeIndex type, a series of DateTime objects, we can easily create new features from the index values, like hour, day, week, month and, day of the week as follows.

df_features = (
                df
                .assign(hour = df.index.hour)
                .assign(day = df.index.day)
                .assign(month = df.index.month)
                .assign(day_of_week = df.index.dayofweek)
                .assign(week_of_year = df.index.week)
              )
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:7: FutureWarning:

weekofyear and week have been deprecated, please use DatetimeIndex.isocalendar().week instead, which returns a Series.  To exactly reproduce the behavior of week and weekofyear and return an Index, you may call pd.Int64Index(idx.isocalendar().week)

Although passing date and time features to the model without any touch may work in practice, it would certainly make it harder for the model to learn interdependencies between these features. For us humans, it is rather straightforward to see the hours, days, weeks and months follow somewhat cyclical patterns. While it is trivial for us to say that December is followed by January, it may not be very clear for algorithms to discern the first month of the year comes after the 12th one. One can easily come up with many more examples for that matter. So, this makes good feature engineering crucial for building deep learning models, even more so for traditional machine learning models.

One-hot encoding

One way to encode DateTime features is to treat them as categorical variables and add a new binary variable for each unique value, widely known as one-hot encoding. Suppose you applied one-hot encoding on your month column, which ranges from 1 to 12. In this case, 12 new month columns are created, say [Jan, Feb, … Dec] and only one of such columns has the value 1 while the rest being zeroed. For instance, some DateTime value from February should have the second of these encoded columns as 1, as in [0, 1, … 0]. Using Pandas’ get_dummies method we can easily create one-hot encoded columns from a given dataset.

def onehot_encode_pd(df, cols):
    for col in cols:
        dummies = pd.get_dummies(df[col], prefix=col)
    
    return pd.concat([df, dummies], axis=1).drop(columns=cols)

df_features = onehot_encode_pd(df_features, ['month','day','day_of_week','week_of_year'])
df_features.columns
Index(['value', 'hour', 'week_of_year_1', 'week_of_year_2', 'week_of_year_3',
       'week_of_year_4', 'week_of_year_5', 'week_of_year_6', 'week_of_year_7',
       'week_of_year_8', 'week_of_year_9', 'week_of_year_10',
       'week_of_year_11', 'week_of_year_12', 'week_of_year_13',
       'week_of_year_14', 'week_of_year_15', 'week_of_year_16',
       'week_of_year_17', 'week_of_year_18', 'week_of_year_19',
       'week_of_year_20', 'week_of_year_21', 'week_of_year_22',
       'week_of_year_23', 'week_of_year_24', 'week_of_year_25',
       'week_of_year_26', 'week_of_year_27', 'week_of_year_28',
       'week_of_year_29', 'week_of_year_30', 'week_of_year_31',
       'week_of_year_32', 'week_of_year_33', 'week_of_year_34',
       'week_of_year_35', 'week_of_year_36', 'week_of_year_37',
       'week_of_year_38', 'week_of_year_39', 'week_of_year_40',
       'week_of_year_41', 'week_of_year_42', 'week_of_year_43',
       'week_of_year_44', 'week_of_year_45', 'week_of_year_46',
       'week_of_year_47', 'week_of_year_48', 'week_of_year_49',
       'week_of_year_50', 'week_of_year_51', 'week_of_year_52',
       'week_of_year_53'],
      dtype='object')

Though quite useful to encode categorical features, one-hot encoding does not fully capture the cyclical patterns in DateTime features. It simply creates categorical buckets, if you will, and lets the model learn from these seemingly independent features. Encoding the day of the week in a similar manner, for instance, loses the information that Monday is closer to Tuesday than Wednesday.

For some use cases, this may not matter too much, indeed. In fact, with enough data, training time, and model complexity, the model may learn such relationships between such features independently. But there is also another way.

Generating cyclical features (sin/cos transformation)

As with all the data we have worked on until now, some data is inherently cyclical. Be it hours, days, weeks, or months, they all follow periodic cycles. Again, this is trivial for us to see, but not so much for machine learning models. The problem simply becomes how can we tell algorithms that the hours 23 and 0 are as close as hour 1 is to hour 2?

The gist is to create two new cyclical features, calculating sine and cosine transform of the given DateTime feature, say the hour of the day. Instead of using the original value for the hour, the model then uses the sine transform of the hour, preserving the cyclicality of the model. To see how and why it works, feel free to have a look at Pierre-Louis’ or David’s blog post on the matter, which explains the concept more in detail.

def generate_cyclical_features(df, col_name, period, start_num=0):
    kwargs = {
        f'sin_{col_name}' : lambda x: np.sin(2*np.pi*(df[col_name]-start_num)/period),
        f'cos_{col_name}' : lambda x: np.cos(2*np.pi*(df[col_name]-start_num)/period)    
             }
    return df.assign(**kwargs).drop(columns=[col_name])

df_features = generate_cyclical_features(df_features, 'hour', 24, 0)
# df_features = generate_cyclical_features(df_features, 'day_of_week', 7, 0)
# df_features = generate_cyclical_features(df_features, 'month', 12, 1)
# df_features = generate_cyclical_features(df_features, 'week_of_year', 52, 0)

df_features.head()
value week_of_year_1 week_of_year_2 week_of_year_3 week_of_year_4 week_of_year_5 week_of_year_6 week_of_year_7 week_of_year_8 week_of_year_9 week_of_year_10 week_of_year_11 week_of_year_12 week_of_year_13 week_of_year_14 week_of_year_15 week_of_year_16 week_of_year_17 week_of_year_18 week_of_year_19 week_of_year_20 week_of_year_21 week_of_year_22 week_of_year_23 week_of_year_24 week_of_year_25 week_of_year_26 week_of_year_27 week_of_year_28 week_of_year_29 week_of_year_30 week_of_year_31 week_of_year_32 week_of_year_33 week_of_year_34 week_of_year_35 week_of_year_36 week_of_year_37 week_of_year_38 week_of_year_39 week_of_year_40 week_of_year_41 week_of_year_42 week_of_year_43 week_of_year_44 week_of_year_45 week_of_year_46 week_of_year_47 week_of_year_48 week_of_year_49 week_of_year_50 week_of_year_51 week_of_year_52 week_of_year_53 sin_hour cos_hour
Datetime
2002-01-01 01:00:00 30393.0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.258819 0.965926
2002-01-01 02:00:00 29265.0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.500000 0.866025
2002-01-01 03:00:00 28357.0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.707107 0.707107
2002-01-01 04:00:00 27899.0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.866025 0.500000
2002-01-01 05:00:00 28057.0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.965926 0.258819

Splitting the data into test, validation, and train sets

After creating feature columns, be it time-lagged observations or date/time features, we split the dataset into three different datasets: training, validation, and test sets.

from sklearn.model_selection import train_test_split

def feature_label_split(df, target_col):
    y = df[[target_col]]
    X = df.drop(columns=[target_col])
    return X, y

def train_val_test_split(df, target_col, test_ratio):
    val_ratio = test_ratio / (1 - test_ratio)
    X, y = feature_label_split(df, target_col)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_ratio, shuffle=False)
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=val_ratio, shuffle=False)
    return X_train, X_val, X_test, y_train, y_val, y_test

X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(df_features, 'value', 0.2)

Applying scale transformation

Scaling the values in your dataset is a highly recommended practice for neural networks, as it is for other machine learning techniques. It speeds up the learning by making it easier for the model to update the weights. You can easily do that by using Scikit-learn’s scalers, MinMaxScaler, RobustScaler, StandardScaler, and the like. For more information on the effects of each scaler, please refer to the official documentation.

And, here is a cool trick if you’re looking for a way to switch between scalers quickly. Get yourself comfortable with the switcher function; we may use it again later on.

from sklearn.preprocessing import MinMaxScaler, StandardScaler, MaxAbsScaler, RobustScaler

def get_scaler(scaler):
    scalers = {
        "minmax": MinMaxScaler,
        "standard": StandardScaler,
        "maxabs": MaxAbsScaler,
        "robust": RobustScaler,
    }
    return scalers.get(scaler.lower())()
scaler = get_scaler('minmax')
X_train_arr = scaler.fit_transform(X_train)
X_val_arr = scaler.transform(X_val)
X_test_arr = scaler.transform(X_test)

y_train_arr = scaler.fit_transform(y_train)
y_val_arr = scaler.transform(y_val)
y_test_arr = scaler.transform(y_test)

Loading the data into DataLoaders

After you standardize your data, you are usually good to go. Not so fast, this time. After spending quite some time working with PyTorch and going through others’ code on the internet, I noticed most people ended up doing the matrix operations for mini-batch training, i.e., slicing the data into smaller batches, using NumPy. You may think that’s what NumPy is for; I get it. But there is also a more elegant PyTorch way of doing it, which certainly gets much less attention than it should, in my opinion.

PyTorch’s DataLoader class, a Python iterable over Dataset, loads the data and splits them into batches for you to do mini-batch training. The most important argument for the DataLoader constructor is the Dataset, which indicates a dataset object to load data from. There are mainly two types of datasets, one being map-style datasets and the other iterable-style datasets.

In this tutorial, I’ll use the latter, but feel free to check them out in the official documentation. It is also possible to write your own Dataset or DataLoader classes for your requirements, but that’s definitely beyond the scope of this post as the built-in constructors would do more than suffice. But here’s a link to the official tutorial on the topic.

For now, I’ll be using the class called TensorDataset, a dataset class wrapping the tensors. Since Scikit-learn’s scalers output NumPy arrays, I need to convert them into Torch tensors to load them into TensorDatasets. After creating Tensor datasets for each dataset, I’ll use them to create my DataLoaders. You may notice an extra DataLoader with the batch size of 1 and wonder why the hell we need it. I’ll get to that in a bit.

from torch.utils.data import TensorDataset, DataLoader

batch_size = 64

train_features = torch.Tensor(X_train_arr)
train_targets = torch.Tensor(y_train_arr)
val_features = torch.Tensor(X_val_arr)
val_targets = torch.Tensor(y_val_arr)
test_features = torch.Tensor(X_test_arr)
test_targets = torch.Tensor(y_test_arr)

train = TensorDataset(train_features, train_targets)
val = TensorDataset(val_features, val_targets)
test = TensorDataset(test_features, test_targets)

train_loader = DataLoader(train, batch_size=batch_size, shuffle=False, drop_last=True)
val_loader = DataLoader(val, batch_size=batch_size, shuffle=False, drop_last=True)
test_loader = DataLoader(test, batch_size=batch_size, shuffle=False, drop_last=True)
test_loader_one = DataLoader(test, batch_size=1, shuffle=False, drop_last=True)

Defining the RNN model classes

I don’t think I can ever do justice to RNNs if I try to explain the nitty-gritty of how they work in just a few sentences here. Fortunately, there are several well-written articles on these networks for those who are looking for a place to start, Andrej Karpathy’s The Unreasonable Effectiveness of Recurrent Neural Networks, Chris Olah’s Understanding LSTM networks, and Michael Phi’s Illustrated Guide to LSTM’s and GRU’s: A step by step explanation are a few that come to mind.

However, traditional neural networks can’t do this, and they start from scratch every time they are given a task, pretty much like Leonard, you see. RNN addresses this shortcoming. To make a gross oversimplification, they do so by looping the information from one step of the network to the next, allowing information to persist within the network. This makes them a pretty strong candidate to solve various problems involving sequential data, such as speech recognition, language translation, or time-series forecasting, as we will see in a bit.

Vanilla RNN

By extending PyTorch’s nn.Module, a base class for all neural network modules, we define our RNN module as follows. Our RNN module will have one or more RNN layers connected by a fully connected layer to convert the RNN output into desired output shape. We also need to define the forward propagation function as a class method, called forward(). This method is executed sequentially, passing the inputs and the zero-initialized hidden state. Nonetheless, PyTorch automatically creates and computes the backpropagation function backward().

class RNNModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim, dropout_prob):
        """The __init__ method that initiates an RNN instance.

        Args:
            input_dim (int): The number of nodes in the input layer
            hidden_dim (int): The number of nodes in each layer
            layer_dim (int): The number of layers in the network
            output_dim (int): The number of nodes in the output layer
            dropout_prob (float): The probability of nodes being dropped out

        """
        super(RNNModel, self).__init__()

        # Defining the number of layers and the nodes in each layer
        self.hidden_dim = hidden_dim
        self.layer_dim = layer_dim

        # RNN layers
        self.rnn = nn.RNN(
            input_dim, hidden_dim, layer_dim, batch_first=True, dropout=dropout_prob
        )
        # Fully connected layer
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        """The forward method takes input tensor x and does forward propagation

        Args:
            x (torch.Tensor): The input tensor of the shape (batch size, sequence length, input_dim)

        Returns:
            torch.Tensor: The output tensor of the shape (batch size, output_dim)

        """
        # Initializing hidden state for first input with zeros
        h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_().cuda()

        # Forward propagation by passing in the input and hidden state into the model
        out, h0 = self.rnn(x, h0.detach())

        # Reshaping the outputs in the shape of (batch_size, seq_length, hidden_size)
        # so that it can fit into the fully connected layer
        out = out[:, -1, :]

        # Convert the final state to our desired output shape (batch_size, output_dim)
        out = self.fc(out)
        return out

Vanilla RNN has one shortcoming, though. Simple RNNs can connect previous information to the current one, where the temporal gap between the relevant past information and the current one is small. As that gap grows, RNNs become less capable of learning the long-term dependencies, also known as the vanishing gradient problem. This is where LSTM comes for help.

Long Short-Term Memory (LSTM)

Long Short-Term Memory, LSTM for short, is a special type of recurrent network capable of learning long-term dependencies and tends to work much better than the standard version on a wide variety of tasks. RNNs on steroids, so to speak.

The standard version’s main difference is that, in addition to the hidden state, LSTMs have the cell state, which works like a conveyor belt that carries the relevant information from the earlier steps to later steps. Along the way, the new information is added to or removed from the cell state via input and forget gates, two neural networks that determine which information is relevant. From the implementation standpoint, you don’t really have to bother with such details. All you need to add is a cell state in your forward() method.

class LSTMModel(nn.Module):
    """LSTMModel class extends nn.Module class and works as a constructor for LSTMs.

       LSTMModel class initiates a LSTM module based on PyTorch's nn.Module class.
       It has only two methods, namely init() and forward(). While the init()
       method initiates the model with the given input parameters, the forward()
       method defines how the forward propagation needs to be calculated.
       Since PyTorch automatically defines back propagation, there is no need
       to define back propagation method.

       Attributes:
           hidden_dim (int): The number of nodes in each layer
           layer_dim (str): The number of layers in the network
           lstm (nn.LSTM): The LSTM model constructed with the input parameters.
           fc (nn.Linear): The fully connected layer to convert the final state
                           of LSTMs to our desired output shape.

    """
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim, dropout_prob):
        """The __init__ method that initiates a LSTM instance.

        Args:
            input_dim (int): The number of nodes in the input layer
            hidden_dim (int): The number of nodes in each layer
            layer_dim (int): The number of layers in the network
            output_dim (int): The number of nodes in the output layer
            dropout_prob (float): The probability of nodes being dropped out

        """
        super(LSTMModel, self).__init__()

        # Defining the number of layers and the nodes in each layer
        self.hidden_dim = hidden_dim
        self.layer_dim = layer_dim

        # LSTM layers
        self.lstm = nn.LSTM(
            input_dim, hidden_dim, layer_dim, batch_first=True, dropout=dropout_prob
        )

        # Fully connected layer
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        """The forward method takes input tensor x and does forward propagation

        Args:
            x (torch.Tensor): The input tensor of the shape (batch size, sequence length, input_dim)

        Returns:
            torch.Tensor: The output tensor of the shape (batch size, output_dim)

        """
        # Initializing hidden state for first input with zeros
        h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_().to(device)

        # Initializing cell state for first input with zeros
        c0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_().to(device)

        # We need to detach as we are doing truncated backpropagation through time (BPTT)
        # If we don't, we'll backprop all the way to the start even after going through another batch
        # Forward propagation by passing in the input, hidden state, and cell state into the model
        out, (hn, cn) = self.lstm(x, (h0.detach(), c0.detach()))

        # Reshaping the outputs in the shape of (batch_size, seq_length, hidden_size)
        # so that it can fit into the fully connected layer
        out = out[:, -1, :]

        # Convert the final state to our desired output shape (batch_size, output_dim)
        out = self.fc(out)

        return out

Gated Recurrent Unit (GRU)

Gated Recurrent Units (GRU) is a slightly more streamlined variant that provides comparable performance and considerably faster computation. Like LSTMs, they also capture long-term dependencies, but they do so using reset and update gates without any cell state.

While the update gate determines how much of the past information needs to be kept, the reset gate decides how much of the past information to forget. Doing fewer tensor operations, GRUs are often faster and require less memory than LSTMs. As you see below, its model class is almost identical to the RNN’s.

class GRUModel(nn.Module):
    """GRUModel class extends nn.Module class and works as a constructor for GRUs.

       GRUModel class initiates a GRU module based on PyTorch's nn.Module class.
       It has only two methods, namely init() and forward(). While the init()
       method initiates the model with the given input parameters, the forward()
       method defines how the forward propagation needs to be calculated.
       Since PyTorch automatically defines back propagation, there is no need
       to define back propagation method.

       Attributes:
           hidden_dim (int): The number of nodes in each layer
           layer_dim (str): The number of layers in the network
           gru (nn.GRU): The GRU model constructed with the input parameters.
           fc (nn.Linear): The fully connected layer to convert the final state
                           of GRUs to our desired output shape.

    """
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim, dropout_prob):
        """The __init__ method that initiates a GRU instance.

        Args:
            input_dim (int): The number of nodes in the input layer
            hidden_dim (int): The number of nodes in each layer
            layer_dim (int): The number of layers in the network
            output_dim (int): The number of nodes in the output layer
            dropout_prob (float): The probability of nodes being dropped out

        """
        super(GRUModel, self).__init__()

        # Defining the number of layers and the nodes in each layer
        self.layer_dim = layer_dim
        self.hidden_dim = hidden_dim

        # GRU layers
        self.gru = nn.GRU(
            input_dim, hidden_dim, layer_dim, batch_first=True, dropout=dropout_prob
        )

        # Fully connected layer
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        """The forward method takes input tensor x and does forward propagation

        Args:
            x (torch.Tensor): The input tensor of the shape (batch size, sequence length, input_dim)

        Returns:
            torch.Tensor: The output tensor of the shape (batch size, output_dim)

        """
        # Initializing hidden state for first input with zeros
        h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_().to(device)

        # Forward propagation by passing in the input and hidden state into the model
        out, _ = self.gru(x, h0.detach())

        # Reshaping the outputs in the shape of (batch_size, seq_length, hidden_size)
        # so that it can fit into the fully connected layer
        out = out[:, -1, :]

        # Convert the final state to our desired output shape (batch_size, output_dim)
        out = self.fc(out)

        return out

Similar to the trick we do with scalers, we can also easily switch between these models we just created.

def get_model(model, model_params):
    models = {
        "rnn": RNNModel,
        "lstm": LSTMModel,
        "gru": GRUModel,
    }
    return models.get(model.lower())(**model_params)

Now, it seems like we got everything ready to train our RNN models. But where do we start?

Making predictions

Let’s start by creating the main framework for training the models. There are probably heaps of ways to do this, and one of them is to use a helper, or a wrapper, class that holds the training, validation, and evaluation methods. First, we need to have a model class, a loss function to calculate the losses, and an optimizer to update the weights in the network.

Helper/Wrapper Class for training

If you’re familiar with neural networks, you already know that training them is a rather repetitive process, looping back and forth between forward-prop and back-prop. I find it useful to have one level of abstraction, a train step function or wrapper, to combine these repetitive steps.

After defining one proper training step, we can now move onto writing the training loop where this step function will be called at each epoch. During each epoch in training, there are two stages: training and validation. After each training step, the network’s weights are tweaked a bit to minimize the loss function. Then, the validation step will evaluate the current state of the model to see if there has been any improvement after the most recent update.

As I’ll be using mini-batch training, a training technique where only a portion of data is used at each epoch, there will be two for loops for each stage where a model is trained and validated batch by batch. This usually requires reshaping each batch tensor into the correct input dimensions so that the network can use it as an input.

Another important thing to note is to activate the train() mode during training and the eval() mode during the validation. While the train() mode allows the network’s weights to be updated, the evaluation () mode signals the model that there is no need to calculate the gradients. Hence, the weights stay the same.

Now, we can finally train our model. However, without evaluating these models with a separate test set, i.e., a hold-out set, it would be impossible to tell how the model performs than other models we’re building. Much similar to the validation loop in the train() method, we’ll define a testing method to evaluate our models as follows.

During the training, the loss function outputs are generally a good indicator of whether the model is learning, overfitting, or underfitting. For this reason, we’ll be plotting simple loss figures by using the following method.

class Optimization:
    """Optimization is a helper class that allows training, validation, prediction.

    Optimization is a helper class that takes model, loss function, optimizer function
    learning scheduler (optional), early stopping (optional) as inputs. In return, it
    provides a framework to train and validate the models, and to predict future values
    based on the models.

    Attributes:
        model (RNNModel, LSTMModel, GRUModel): Model class created for the type of RNN
        loss_fn (torch.nn.modules.Loss): Loss function to calculate the losses
        optimizer (torch.optim.Optimizer): Optimizer function to optimize the loss function
        train_losses (list[float]): The loss values from the training
        val_losses (list[float]): The loss values from the validation
        last_epoch (int): The number of epochs that the models is trained
    """
    def __init__(self, model, loss_fn, optimizer):
        """
        Args:
            model (RNNModel, LSTMModel, GRUModel): Model class created for the type of RNN
            loss_fn (torch.nn.modules.Loss): Loss function to calculate the losses
            optimizer (torch.optim.Optimizer): Optimizer function to optimize the loss function
        """
        self.model = model
        self.loss_fn = loss_fn
        self.optimizer = optimizer
        self.train_losses = []
        self.val_losses = []
        
    def train_step(self, x, y):
        """The method train_step completes one step of training.

        Given the features (x) and the target values (y) tensors, the method completes
        one step of the training. First, it activates the train mode to enable back prop.
        After generating predicted values (yhat) by doing forward propagation, it calculates
        the losses by using the loss function. Then, it computes the gradients by doing
        back propagation and updates the weights by calling step() function.

        Args:
            x (torch.Tensor): Tensor for features to train one step
            y (torch.Tensor): Tensor for target values to calculate losses

        """
        # Sets model to train mode
        self.model.train()

        # Makes predictions
        yhat = self.model(x)

        # Computes loss
        loss = self.loss_fn(y, yhat)

        # Computes gradients
        loss.backward()

        # Updates parameters and zeroes gradients
        self.optimizer.step()
        self.optimizer.zero_grad()

        # Returns the loss
        return loss.item()

    def train(self, train_loader, val_loader, batch_size=64, n_epochs=50, n_features=1):
        """The method train performs the model training

        The method takes DataLoaders for training and validation datasets, batch size for
        mini-batch training, number of epochs to train, and number of features as inputs.
        Then, it carries out the training by iteratively calling the method train_step for
        n_epochs times. If early stopping is enabled, then it  checks the stopping condition
        to decide whether the training needs to halt before n_epochs steps. Finally, it saves
        the model in a designated file path.

        Args:
            train_loader (torch.utils.data.DataLoader): DataLoader that stores training data
            val_loader (torch.utils.data.DataLoader): DataLoader that stores validation data
            batch_size (int): Batch size for mini-batch training
            n_epochs (int): Number of epochs, i.e., train steps, to train
            n_features (int): Number of feature columns

        """
        model_path = f'{self.model}_{datetime.now().strftime("%Y-%m-%d %H:%M:%S")}'

        for epoch in range(1, n_epochs + 1):
            batch_losses = []
            for x_batch, y_batch in train_loader:
                x_batch = x_batch.view([batch_size, -1, n_features]).to(device)
                y_batch = y_batch.to(device)
                loss = self.train_step(x_batch, y_batch)
                batch_losses.append(loss)
            training_loss = np.mean(batch_losses)
            self.train_losses.append(training_loss)

            with torch.no_grad():
                batch_val_losses = []
                for x_val, y_val in val_loader:
                    x_val = x_val.view([batch_size, -1, n_features]).to(device)
                    y_val = y_val.to(device)
                    self.model.eval()
                    yhat = self.model(x_val)
                    val_loss = self.loss_fn(y_val, yhat).item()
                    batch_val_losses.append(val_loss)
                validation_loss = np.mean(batch_val_losses)
                self.val_losses.append(validation_loss)

            if (epoch <= 10) | (epoch % 50 == 0):
                print(
                    f"[{epoch}/{n_epochs}] Training loss: {training_loss:.4f}\t Validation loss: {validation_loss:.4f}"
                )

        torch.save(self.model.state_dict(), model_path)

    def evaluate(self, test_loader, batch_size=1, n_features=1):
        """The method evaluate performs the model evaluation

        The method takes DataLoaders for the test dataset, batch size for mini-batch testing,
        and number of features as inputs. Similar to the model validation, it iteratively
        predicts the target values and calculates losses. Then, it returns two lists that
        hold the predictions and the actual values.

        Note:
            This method assumes that the prediction from the previous step is available at
            the time of the prediction, and only does one-step prediction into the future.

        Args:
            test_loader (torch.utils.data.DataLoader): DataLoader that stores test data
            batch_size (int): Batch size for mini-batch training
            n_features (int): Number of feature columns

        Returns:
            list[float]: The values predicted by the model
            list[float]: The actual values in the test set.

        """
        with torch.no_grad():
            predictions = []
            values = []
            for x_test, y_test in test_loader:
                x_test = x_test.view([batch_size, -1, n_features]).to(device)
                y_test = y_test.to(device)
                self.model.eval()
                yhat = self.model(x_test)
                predictions.append(yhat.cpu().detach().numpy())
                values.append(y_test.cpu().detach().numpy())

        return predictions, values

    def plot_losses(self):
        """The method plots the calculated loss values for training and validation
        """
        plt.plot(self.train_losses, label="Training loss")
        plt.plot(self.val_losses, label="Validation loss")
        plt.legend()
        plt.title("Losses")
        plt.show()
        plt.close()

Training the model

So far, we have prepared our dataset, defined our model classes and the wrapper class. We need to put all of them together. Without further ado, let’s start training our model.

import torch.optim as optim

input_dim = len(X_train.columns)
output_dim = 1
hidden_dim = 64
layer_dim = 3
batch_size = 64
dropout = 0.2
n_epochs = 50
learning_rate = 1e-3
weight_decay = 1e-6

model_params = {'input_dim': input_dim,
                'hidden_dim' : hidden_dim,
                'layer_dim' : layer_dim,
                'output_dim' : output_dim,
                'dropout_prob' : dropout}

model = get_model('lstm', model_params).to(device)

loss_fn = nn.MSELoss(reduction="mean")
optimizer = optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)


opt = Optimization(model=model, loss_fn=loss_fn, optimizer=optimizer)
opt.train(train_loader, val_loader, batch_size=batch_size, n_epochs=n_epochs, n_features=input_dim)
opt.plot_losses()

predictions, values = opt.evaluate(
    test_loader_one,
    batch_size=1,
    n_features=input_dim
)
[1/50] Training loss: 0.0173	 Validation loss: 0.0190
[2/50] Training loss: 0.0122	 Validation loss: 0.0139
[3/50] Training loss: 0.0106	 Validation loss: 0.0136
[4/50] Training loss: 0.0096	 Validation loss: 0.0184
[5/50] Training loss: 0.0092	 Validation loss: 0.0161
[6/50] Training loss: 0.0089	 Validation loss: 0.0097
[7/50] Training loss: 0.0087	 Validation loss: 0.0099
[8/50] Training loss: 0.0085	 Validation loss: 0.0102
[9/50] Training loss: 0.0084	 Validation loss: 0.0101
[10/50] Training loss: 0.0082	 Validation loss: 0.0096
[50/50] Training loss: 0.0075	 Validation loss: 0.0077
../../../_images/Building_RNNs_for_time_series_using_PyTorch_45_1.png

Formatting the predictions

As you may recall, we trained our network with standardized inputs; therefore, all the model’s predictions are also scaled. Also, after using batching in our evaluation method, all of our predictions are now in batches. To calculate error metrics and plot these predictions, we need first to reduce these multi-dimensional tensors to a one-dimensional vector, i.e., flatten, and then apply inverse_transform() to get the predictions’ real values.

def inverse_transform(scaler, df, columns):
    for col in columns:
        df[col] = scaler.inverse_transform(df[col])
    return df


def format_predictions(predictions, values, df_test, scaler):
    vals = np.concatenate(values, axis=0).ravel()
    preds = np.concatenate(predictions, axis=0).ravel()
    df_result = pd.DataFrame(data={"value": vals, "prediction": preds}, index=df_test.head(len(vals)).index)
    df_result = df_result.sort_index()
    df_result = inverse_transform(scaler, df_result, [["value", "prediction"]])
    return df_result


df_result = format_predictions(predictions, values, X_test, scaler)
df_result
value prediction
Datetime
2015-04-09 15:00:00 32204.0 30085.351562
2015-04-09 16:00:00 32049.0 30082.298828
2015-04-09 17:00:00 32209.0 30092.333984
2015-04-09 18:00:00 32707.0 30150.080078
2015-04-09 19:00:00 33012.0 30265.785156
... ... ...
2018-08-02 20:00:00 44057.0 45049.468750
2018-08-02 21:00:00 43256.0 43704.136719
2018-08-02 22:00:00 41552.0 41946.421875
2018-08-02 23:00:00 38500.0 39456.320312
2018-08-03 00:00:00 35486.0 35729.273438

29074 rows × 2 columns

Calculating error metrics

After flattening and de-scaling the values, we can now calculate error metrics, such as mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE).

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def calculate_metrics(df):
    result_metrics = {'mae' : mean_absolute_error(df.value, df.prediction),
                      'rmse' : mean_squared_error(df.value, df.prediction) ** 0.5,
                      'r2' : r2_score(df.value, df.prediction)}
    
    print("Mean Absolute Error:       ", result_metrics["mae"])
    print("Root Mean Squared Error:   ", result_metrics["rmse"])
    print("R^2 Score:                 ", result_metrics["r2"])
    return result_metrics

result_metrics = calculate_metrics(df_result)
Mean Absolute Error:        2955.338
Root Mean Squared Error:    3880.14059023639
R^2 Score:                  0.6424345543424599

Generating baseline predictions

Having some sort of baseline model helps us compare how our models actually do at prediction. For this task, I’ve chosen good old linear regression, good enough to generate reasonable baseline but simple enough to do it very fast.

from sklearn.linear_model import LinearRegression

def build_baseline_model(df, test_ratio, target_col):
    X, y = feature_label_split(df, target_col)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_ratio, shuffle=False
    )
    model = LinearRegression()
    model.fit(X_train, y_train)
    prediction = model.predict(X_test)

    result = pd.DataFrame(y_test)
    result["prediction"] = prediction
    result = result.sort_index()

    return result

df_baseline = build_baseline_model(df_features, 0.2, 'value')
baseline_metrics = calculate_metrics(df_baseline)
Mean Absolute Error:        3652.5844053105866
Root Mean Squared Error:    4589.279608903664
R^2 Score:                  0.4997931432605698

Visualizing the predictions

Last but not least, visualizing your results helps you better understand how your model performs and adds what kind of features would likely improve it. I’ll be using Plotly again, but feel free to use a package that you are more comfortable with.

import plotly.offline as pyo
import plotly.graph_objs as go
from plotly.offline import iplot


def plot_predictions(df_result, df_baseline):
    data = []
    
    value = go.Scatter(
        x=df_result.index,
        y=df_result.value,
        mode="lines",
        name="values",
        marker=dict(),
        text=df_result.index,
        line=dict(color="rgba(0,0,0, 0.3)"),
    )
    data.append(value)

    baseline = go.Scatter(
        x=df_baseline.index,
        y=df_baseline.prediction,
        mode="lines",
        line={"dash": "dot"},
        name='linear regression',
        marker=dict(),
        text=df_baseline.index,
        opacity=0.8,
    )
    data.append(baseline)
    
    prediction = go.Scatter(
        x=df_result.index,
        y=df_result.prediction,
        mode="lines",
        line={"dash": "dot"},
        name='predictions',
        marker=dict(),
        text=df_result.index,
        opacity=0.8,
    )
    data.append(prediction)
    
    layout = dict(
        title="Predictions vs Actual Values for the dataset",
        xaxis=dict(title="Time", ticklen=5, zeroline=False),
        yaxis=dict(title="Value", ticklen=5, zeroline=False),
    )

    # fig = dict(data=data, layout=layout)
    # iplot(fig)
    fig = go.Figure(data=data, layout=layout)
    fig.show(renderer="colab")
    
    
# Set notebook mode to work in offline
pyo.init_notebook_mode()

plot_predictions(df_result, df_baseline)

Where to next?

I’d like to say that was all, but there’s and will be certainly more. Deep learning has been one of, if not, the most fruitful research areas in machine learning. The research on the sequential deep learning models is growing and will likely keep growing in the future. You may consider this post as the first step into exploring what these techniques have to offer for time series forecasting.

There are still a few more topics that I’d like to write about, like forecasting into the future time steps using time-lagged and datetime features, regularization techniques, some of which we have already used in this post, and more advanced deep learning architectures for time series. And, the list goes on. Let’s hope that my motivation to keep it going lives up to such ambitions. But, for now, I’d say that’s a wrap.